CPSC 545/445 (Autumn 2003) - Class 6: Pairwise Sequence Alignment
Module 2, Part 2

---
2.4 psa with affine gap scores

reminder: affine gap scores: gamma(g) = -d - (g-1)*e

here: global psa (loccal psa is analogous)


general approach:

modify take the standard global psa (needleman-wunsch) algorithm as follows:

  M(i,j) :=  max {
            M(i-1,j-1) + s(xi ,yj)
            M(k,j) + gamma(i-k)    k := 0...i-1  
            M(i,k) + gamma(j-k)    k := 0...j-1 }

problem: additional iteration over k (= position of last non-gap character)
  -> O(m*n*max{m,n}) time; in practice, this is typically too slow

but note: this works for arbitrary gap scores


better: exploit additive structure of affine gap score
(this will alow us to achieve O(m*n) time complexity)

idea: use a help structure to capture information on start of gaps
	and accumulated gap penalties

-> three cases are pursued simultaneously: 
	"no gap" - best score M(i,j), 
	"(extending) gap in y" = "insertion in x" - best score Ix(i,j)
	"(extending) gap in x" = "insertion in y" - best scost Iy(i,j)

  M(i,j) :=  max {
      	M(i-1, j-1) + s(xi,yj)
      	Ix(i-1, j-1) + s(xi,yj)		% end gap in y
      	Iy(i-1, j-1) + s(xi,yj)		% end gap in x
	}

  Ix(i,j) := max {
	M(i-1,j) -d        % start new gap in y
	Ix(i-1,j) -e       % extend existing gap in y
	}

  Iy(i,j) := max {
     	M(i, j-1) -d       % start new gap in x
      	Iy(i, j-1) -e      % extend existing gap in x
	}


initialization: M(k,0) = Ix(0,k) = Iy(k,0) = M(0,k) = -d-k*e

this is based on the assumption that deletion is never directly followed by insertion
(true for optimal alignment if -d-e < min s(a,b) for any a,b


---
2.5 other sequence alignment problems:

- local alignments with repeated matches 
  useful for finding multiple genes (e.g., tRNA genes) 

- alignments with overlap matches 
  special case of local alignment 
  can be solves using smith-waterman algorithm,
  but there is a better algorithm (simple variation of needleman-wunsch)

- alignment with hybrid(complex) match conditions 

these can all be solved using variants of dynamic programming,
  time/space complexity O(m*n)

-> BSA, Ch.2; reading assignment 


- heuristic alignment -> 2.6
- multiple sequence alignment -> 2.7


---
2.6 Heuristic alignment

idea: reduce complexity of psa algorithms
	by using heuristic method
	-> loose guarantee for finding optimal alignment (w.r.t. given scoring model)

[why may this not be a big problem in practice?]

Here: BLAST (Basic Local Alignment Search Tool; Altschul et al., 1990)

goal: find high-scoring local alignment between query sequence y
	(e.g., specific protein sequence)
	and long target sequence x / database (e.g., genome, chromosome)

basic idea: 
- a true match alignment likely contains very similar or identical subsequences
- find such subsequences and use them as seeds from which to extend
	longer, weaker but still good alignment
- short seeds allows use of table of all possible seeds with starting points
	(index) constructed from query during preprocessing


realisation in BLAST:
- parameters w = seed length; t, C = score thresholds (seed / extension)
	(prot: w=3, dna: w=11)

- find all length w substrings of query y that align to a particular substring 
  of sequence x with score > t
  -> seed table S

- scan through sequence x (or database),
  for each match of subsequence in x with seed in S ("hit"):
    extend ungapped local alignment in both directions
    until alignment score drops below best score for shorter extensions - C

(implementation uses additional heuristics, special techniques to increase
efficiency)

different versions for BLAST for DNA and protein alignments;
various improvements of basic algorithm, including:
- gapped BLAST (for gapped alignment - uses dynamic programming to extend hits), 
- PSI-BLAST (position-specific iterated BLAST, often detects weaker seq similarities)

other heuristic alignment algorithms: FASTA (see BSA, 2.5)


---
Resources:

BSA, Ch.2 [good introduction; contains further references]